Named entity recognition in aerospace based on multi-feature fusion transformer

In recent years, along with the rapid development in the domain of artificial intelligence and aerospace, aerospace combined with artificial intelligence is the future trend. As an important basic tool for Natural Language Processing, Named Entity Recognition technology can help obtain key relevant knowledge from a large number of aerospace data. In this paper, we produced an aerospace domain entity recognition dataset containing 30 k sentences in Chinese and developed a named entity recognition model that is Multi-Feature Fusion Transformer (MFT), which combines features such as words and radicals to enhance the semantic information of the sentences. In our model, the double Feed-forward Neural Network is exploited as well to ensure MFT better performance. We use our aerospace dataset to train MFT. The experimental results show that MFT has great entity recognition performance, and the F1 score on aerospace dataset is 86.10%.


Related work
Aerospace named entity recognition belongs to the specific field of named entity recognition, but it still belongs to the research in the field of named entity recognition.Deep learning has advanced rapidly in recent years and various named entity recognition methods based on deep learning have appeared.As one of the early deep learning models, LSTM was applied to the named entity recognition task by Hammerton 5 .However, LSTM only extracts features in a sentence from a single direction.To solve this problem, Huang et al. used BiLSTM that combined with Conditional random fields (CRF) for the entity recognition task and had achieved satisfactory results.In addition to temporal models that can be used for semantic modeling 6 , Collobert et al. used Convolutional Neural Networks (CNN) as NER model encoders to model local semantic features of sentences and generate corresponding labels with CRF as decoders 7 .Dos Santos used an improved CNN model for Natural Language Processing (NLP), 1D-CNN, to recognize entities.Experiments show that this improvement is very effective 8 .However, these methods only perform feature extraction on characters or words one by one.For this reason, Vaswani et al. proposed a Transformer model based on a self-attentive mechanism, which provides a new idea for named entity recognition.The method not only improves the recognition accuracy of the model, but also reduces the training time of the model 9 .Dai et al. believe that the modeling ability of long-term dependency is crucial to the language model, which is also the defect of Transformer, so they improved it and proposed Trans-former_XL model, which improves the modeling ability of Long-Term dependency by 80%.However, Guo et al. believe that named entity recognition is different from other language models and should pay more attention to the modeling of local semantics 10 .They propose a lightweight Star Transformer model.Experiments show that this model is more suitable for NER tasks.
Chinese named entity recognition methods are classified into character-based named entity recognition methods and word-based entity recognition methods.Character-based approaches lose word information in sentences, and word-based approaches are more influenced by the quality of the segmentation.Liu et al. discuss character-based and word-based approaches separately and conclude that character-based approaches are empirically better choices 11 .However, some researchers have tried to combine the two methods by combining lexicon information on a character-based approach.Gui et al. proposed Lexicon Rethinking Convolutional Neural Network (LR-CNN), which uses a lexicon to assist the model in the determination of entity boundaries 12 .Zhang et al. proposed Lattice LSTM, which reinforces semantic and entity boundaries by using a lexicon.Gui et al. proposed a Lexicon-Based Graph Neural Network (LGN), where the graph neural network is used to introduce the latent word information matched by the dictionary into the model to complete the entity recognition task 13 .Li et al. proposed FLAT, which uses relative position encoding to recover lattice structure information.Since Lattice is compatible with Transformer, the performance of the model is further improved.
In terms of structural features of Chinese characters, Dong et al. introduced the structural information of Chinese characters into the NER model for the first time and used Bi-LSTM for the feature extraction of Chinese radicals; this method achieved the best performance on the MSRA dataset.Meng et al. used images of Chinese characters to assist in completing NER by leveraging the image information of Chinese characters to take advantage of the strokes and structural features of Chinese characters 14 .
There are also many named entity recognition works in the aerospace domain.Xu et al. crawled relevant texts from NASA's official website to produce a spacecraft named entity recognition dataset and used CRF to complete the entity recognition task 15 .Boan Tong et al. used the book World Spacecraft Encyclopedia as the data source for constructing the spacecraft-related dataset and performed migration learning through the Bert-BiGRU-CRF (Bidirection Gated Recurrent Unit, BiGRU) model to fine-tune the model parameters in the spacecraft domain corpus to accomplish the entity recognition task in the spacecraft domain 16 .Tikayat et al. developed an English-language aerospace dataset with which they fine-tuned BERT for better recognition performance in the Use the lexicon to match the Chinese sentence to the words w 1 , w 2 and w 3 , these Chinese words make it easier for the NER model to determine entity boundaries.Whether the entity in a sentence is w 1 or w 2 or w 3 can be determined by the NER model using contextual semantics.

Aerospace dataset
Since there is no publicly available named entity recognition dataset in the aerospace domain, we use the crawler system to obtain relevant corpus from the data on Internet websites such as Wikipedia to the extent permitted by laws and use Label Studio for manual labelling.A dataset of aerospace domain with 29,953 sentences and 51,482 entities is made.The construction process of the aerospace dataset is shown in Fig. 2. First, we use a crawler based on the Scrapy framework to obtain aerospace data from Wikipedia and China Aerospace News, then we filtered the corpus to remove contents that are not relevant to the domain.After that, we sliced the corpus in sentences and ensured that each sentence contained at least two aerospace entities.Finally, the corpus was labeled in the BIO format with the help of Label Studio.An example of the BIO Labeling format is shown in Fig. 3, where 'B' stands for 'Begin' and is used to annotate the head of the entity, 'I' stands for 'Inside' and is used to annotate the rest of the entity and 'O' is for 'Outside' and is used to annotate the non-entity.
Entities are categorized into aerospace companies and organizations (ACAO), Airports and spacecraft launch sites (AASLS), Type of aerospace vehicle (TOAV), Constellations and satellites (CAS), Space missions and projects (SMAP), Scientists and astronauts (SAA), aerospace technology and equipment (ATAE).7 types.In this paper, 80% of the data in the dataset is used to train the model, 10% is used to validate the model, 10% is used to test the model.The main information of the aerospace dataset is shown in the Table 1.

Multi-feature fusion transformer
Since the word and radical information are very important features for Chinese characters.So in this paper we use MFT that can fuse these information as a named entity recognition model.The network structure of the MFT model is shown in Fig. 4. The model first extracts the radical embedding of Chinese characters through 1D-CNN, then fuses it with the Lattice sequence embedding output by the FLAT-Lattice model and encodes it as the input of the Flat-Lattice model, which is encoded as inputs to the Double Feed-forward Multi-head Self-attention (DFMS) encoder module, and finally decodes the corresponding label sequences by CRF.In the DFMS encoder module, MFT has exploited the structure of the Transformer by adding a Feed-Forward Neural Network (FFN) before the multi-headed self-attention module.This sandwich structure of the Transformer shows better performance in the NER.

Flat-lattice module
Similar to the Flat-Lattice module in the FLAT model, the Flat-Lattice module in the MFT uses a lexicon to match the input sentence to obtain the potential words contained in the sentence and encodes the positions of characters and these potential words in order to construct the Lattice.The structure of the Flat-Lattice module is shown in Fig. 5.For example, if a sentence containing 12 characters is entity recognized.Match the sentence with the lexicon to get the potential words w 1 , w 2 and w 3 .These matched potential words are placed at the end of the sentence as candidates for the entities in the sentence, which together with the sentence form the lattice sequence LS = {ls 1 ,…,ls n }.The tokens in the Flat-Lattice are then located using the head position and tail position to restore the Lattice structure information.
Next, for the Flat-Lattice sequence, we need to convert it to Flat-Lattice sequence embedding and encode it by positions.The Lattice sequence embedding LE = {le 1 ,…,le n } can be obtained by matching LS in a pretrained embedding table.Their positional embeddings, on the other hand, are calculated respectively by using Eqs.(1)- (7).where R ij in Eq. (1) represents the relative position encoding between token i and token j with ⊕ representing the concatenation operator and W P being the learnable parameter.P hh d ij represents the encoding of the relative distance between the head positions of token i and token j, P ht d ij , P th d ij and P tt d ij have similar meaning, which is calculated using the same formula as the position code calculation in Transformer, P hh d ij as shown in Eqs. ( 2)-( 3), where k represents the index of the position coding dimension, d emb represents the position coding dimension, d hh ij represents the distance between the head positions of token i and token j, d ht ij , d th ij , and d tt ij have similar meaning, and they are calculated by Eqs. ( 4)- (7), where head[i] denotes the head position of token i and tail[j] denotes the tail position of token j.

Radical feature module
In Chinese, some characters such as "river", "lake" and "sweat" are related to water, so they all contain the same radical.The radicals in Chinese characters are similar to the root affixes in English.As a kind of characters evolved from hieroglyphs, Chinese characters contain a lot of semantic features in their radicals.In order to use these semantic features to enhance the semantic information of sentences, Radical Feature Module splits each Chinese character into multiple radicals by radical dictionary and encodes these radicals by 1D-CNN to obtain the radical encoding of the corresponding Chinese character.
Take the radical encoding of a sentence containing 12 characters as an example.Each character in the sentence is matched in the radical dictionary to obtain the radical group corresponding to each character, where the character with the highest number of radicals contains 3 radicals, then the size of the convolution kernel of 1D-CNN is set to 3 and the step size is also 3. The remaining words with less than 3 radicals are filled in with "<PAD>", a symbol used exclusively for filling in deep learning.The convolution process is shown in Fig. 6.By this method, we can obtain the corresponding radical embedding sequence RE = {re 1 ,…,re n } for the sentence.
Radical Feature Module only extracts the radical feature of the characters in the input sentence, and the radical feature of the potential words in the sentence is not extracted.This results in that the lengths of LE and RE are different.In order to facilitate the subsequent fusion of them, Radical Feature Module also uses "< PAD >" to fill in the radical sequence embedding RE, so that the lengths of LE and RE are consistent with each other.
Finally, the lattice sequence embedding and the radical feature sequence embedding are concatenated to obtain the sequence embedding E = {e 1 ,…,e n }, as shown in Eqs. ( 8) and ( 9).

DFMS encoder module
There are two kinds of neural networks in DFMS encoder, which are Self-Attention Neural Network and Feedforward Neural Network.The structure is shown in Fig. 7, where the Self-attentive Neural Network is the same as the self-attentive network in Transformer_XL, which uses relative position coding, with the aim of improving the model's ability to model long-term dependencies.DFMS has a Double Feedforward Neural Network, with the self-attentive neural network added between them.This structure has proven to be effective in Conformer 18 .Residual connections and normalization are also required between each layer of neural networks.
We use the sequence embedding E, which fuses LE and RE, as the input to the DFMS encoder.As shown in Eqs. ( 10)- (14).Firstly, the sequence embedding E enters into the first layer of FFN calculation, and the output result enters into the Self-attentive Neural Network for self-attention coding after the layer normalization and residual connection.The coded result also needs the residual connection and layer normalization processing.In the end, after the second FFN calculation, the final encoding of the encoder is obtained. (1) Vol:.( 1234567890 where W q , W k , W v and W R are the query mapping matrix, the key mapping matrix, the value mapping matrix and the position mapping matrix respectively, all of which are learnable parameters.With W q , W k and W v , the sequence embedding E is mapped to the query matrix Q, the key matrix K and the value matrix V respectively.In Eq. ( 12) u and v are also learnable parameters that are used to ensure that the attentional bias of the query vector remains constant for different tokens

CRF decoder module
Conditional random fields are often used in machine learning-based named entity recognition methods.Benefiting from its excellent performance, CRF is usually used as decoders based on neural network named entity recognition models.CRF is a conditional probability distribution model that can be used to solve prediction problems.Cuong et al. propose that CRF can be used to solve the labeling problem and derive the most sensible label in conjunction with the semantic 19 .In the named entity recognition task, CRF takes the input sequence of  observations as the set of random variables X and the output sequence of labels as Y.As shown in Eqs. ( 15)-( 16), for a sequence X = {x 1 ,…,x n }, the corresponding sequence of labels is Y = {y 1 ,…,y n }.The probability of y is P.
where t k is the transfer eigen-function and s l is the state eigenfunction, taking values of 1 or 0. λ k and u l are the corresponding weight coefficients, which are learnable parameters.

Experiment
In the experiments of this paper, we compare the performance of the MFT model with some mainstream named entity recognition models on our aerospace dataset.
In addition, in order to verify whether the MFT model is only effective on our aerospace dataset, we also conduct performance comparison experiments on some commonly used and public named entity recognition datasets such as Weibo and Resume datasets.Finally, we conduct an effectiveness study on the MFT model to verify the effectiveness of our model structure.

Evaluation indicator
Common evaluation criteria used in Named Entity Recognition tasks are precision (P), recall (R) and F 1 score.(F 1 ).They are calculated respectively by using the formulas ( 17)- (19).Precision is the percentage of labels predicted by the model that are correctly predicted.Recall is the number of samples in the sample that are correctly predicted.As precision and recall are mutually exclusive metrics, a combined metric F 1 score is also needed to judge the recognition performance of the model.
where TP denotes a positive sample with a correct prediction, FP denotes a negative sample with a failed prediction, FN denotes a positive sample with a failed prediction and TN denotes a negative sample with a correct prediction.

Dataset
In this paper, we constructed an Aerospace Named Entity Recognition dataset with data from Wikipedia and the China Aerospace News website.It contains 30 k sentences and 53,788 entities.We predefined seven entity types based on the contents of the data, which were labeled by six annotators dividing the work among themselves, and the results were confirmed and validated by a manager.The whole labeling process took about one month.We divide the dataset in the ratio of 8:1:1 to get the training dataset, developing dataset and testing dataset for training and testing our model.The dataset information is shown in Table 1.We used two mainstream Chinese NER datasets, the Weibo dataset 20,21 and the Resume dataset 1 .The corpus of the Weibo dataset is mainly drawn from social media and contains four types of entities: Person, Location, Organization and Geopolitic.The corpus of the Resume dataset is mainly from Sina Finance.and was made by manually labeling named entities with YEDDA system.Table 2 shows the main information of both datasets.( 15)

Experimental environment and parameters
In our experiments, we used the same word lexicon and pre-trained character and word embeddings as in the Lattice-LSTM, Radical lexicon from https:// github.com/ kfcd/ chaizi.All comparison model codes are provided by the original authors.Our model was trained on an Ubuntu system using an RTX 3060.The hyperparameters are set differently for different datasets.The hyperparameter setting for MFT are shown in Table 3.The hyperparameters are set differently for different datasets.On the aerospace dataset MFT consists of 9,765,018 trainable parameters.On the resume dataset, MFT consists of 9,319,506 trainable parameters.

Experimental results
In this study, we use the F 1 score as a criterion for judging the performance of the models, so the precision and recall of the models are the results achieved by the model with the highest F 1 score on the test set.

Aerospace dataset
The experimental results of MFT on the aerospace dataset are shown in Table 4.The experimental results indicate that MFT performs well, with a significant performance improvement of 0.97% in F 1 score compared to the baseline model FLAT, the recall rate increased by 0.77%, the precision, is 1.16%.LR-CNN and LGN performed worse on the aerospace dataset than on the other datasets, while the LSTM combined with Lattice achieved an F 1 score of 71.33%, which is 9.88% lower than our MFT model.
The adoption of the pre-training model BERT by MFT results in a substantial improvement in each performance.Although MFT + BERT does not perform as well as FLAT + BERT in terms of recall, both F 1 and P have to perform better.
Figure 8 shows the F 1 curve of each model during training on the aerospace dataset, and the performance improvement of MFT in terms of F 1 score is obvious compared to LGN, Lattice-LSTM and LR-CNN.Compared to FLAT, MFT has a faster improvement in F 1 score in the early stage of training.From the precision curve of each model in Fig. 9, it can be seen that MFT performs much better than FLAT in terms of precision during the training process, and after the 100th Epoch MFT's precision curve is higher than FLAT's precision curve almost everywhere.However, the recall curves of all models in Fig. 10 show that there is not much difference between the performance of MFT and FLAT with respect to the recall criterion, so the improvement in the overall performance metric F 1 score of MFT mainly comes from the improvement in the recognition precision of the model.
Table 5 shows the recognition of MFT for different classes of entities on the aerospace dataset.The best recognized entity type is AEAT with F 1 score of 83.48% followed by TOAV.The worst recognition rate is AASLS with F 1 score of 61.11% and also AASLS has the least number of entities.Thus the recognition effectiveness of the model is directly related to the amount of data.

Weibo dataset
Table 6 shows the experimental results of MFT on the Weibo dataset.Compared with other comparison models, MFT has a greater performance improvement with F 1 score of 64.38%.LR-CNN has the best performance in terms of precision, but the recall rate is 15.03% lower compared to MFT and the F 1 score is 7.84% lower.The comprehensive performance of the model is improved to a higher level when MFT uses BERT to pre-train the model.

Resume dataset
The experimental results of MFT on the Resume dataset are shown in Table 7.The experiments demonstrate that the Double Feed-forward Neural Network and the radical information of Chinese characters do bring performance improvements to the model with F 1 score of 95.78%, precision of 96.05% and recall rate of 95.52%, all of which are better than other models.

Experiments of feature fusion method
To study the effect on the MFT model after using different fusion methods on LE and RE, we conducted experiments on MFT on all three datasets.The experimental results are shown in Table 8.On the Weibo dataset and the Resume dataset, concatenating LE and RE performed better than adding them together.In contrast, For the aerospace dataset, concatenating LE and RE together still outperforms FLAT despite a decrease in precision, while the F 1 and recall of MFT are improved, especially the recall by 0.98%.

Experiments of FFN
The Conformer being used to solve the speech recognition problem contains a double half-step FFN, while the MFT contains a double full-step FFN.In order to verify whether double full-step FFN can bring more performance improvement than double half-step FFN in the named entity recognition task, we set up experiments on the impact of different FFN weight connection methods on the model performance.The experimental results are shown in Table 9.Compared to the double half-step FFN, the double full-step FFN is more suitable for the Named Entity Recognition task.

Effectiveness study
There are two main improvements of the MFT model, namely, the radical information of Chinese characters was added to enhance the semantics, and double FFN was used to improve the feature encoding capability of the model.In order to verify whether all these improvements bring performance benefits to MFT, we disassemble the model structure and conduct experiments on each of the three datasets.As shown in Table 10, we removed the Double FFN of the MFT and the F 1 scores of the MFT dropped by 0.47%, 0.5%, 0.19% on the Aerospace, Weibo, and Resume datasets, respectively, after which we proceeded to remove the Radical Feature Module of MFT and revert to FLAT, the F 1 scores of MFT dropped by 0.5%, 3.56%, 0.14%, respectively.Results in Table show that both improvements on the MFT are effective.The effect of the radical feature on the attention of the model is intuitive, as can be seen in Fig. 11, where FLAT has a more focused attention score, while MFT adds extra attention to the information of FLAT.In such a way, the attention to key information is ensured not to be distracted.This allows MFT to converge faster than FLAT during the training of the model, and as shown in Fig. 12, where the loss curve of MFT is lower and decreases faster than that of FLAT.

Conclusions
In this paper, we propose an Aerospace Named Entity Recognition method based on multi-feature fusion Transformer.Big data from Wikipedia and China Aerospace News are obtained as corpus by crawlers and the aerospace dataset is produced using a manual labelling method.We train and test the MFT on our dataset and the experimental results demonstrate that our model has excellent performance, due to the fact that the radical features of the Chinese characters and the double Feed-forward Neural Network can provide a boost to the recognition rate of the MFT.
In future work, a wider range of Chinese features, such as the pronunciation and graphics of Chinese characters, could also be incorporated for a multimodal approach.However, incorporating more diverse features may introduce invalid elements or noise, which may lead to an increase in model parameters.To mitigate this problem, future work may also require filtering of features to reduce the model size and save computational costs.

Figure 2 . 3 Figure 3 .
Figure 2. Extracting relevant corpus from Wikipedia and Chinese space news, screening and segmentation of the corpus and labeling it in BIO format using Label Studio.

Figure 4 .
Figure 4.The flat lattice module and the radical feature module represent the embedding of the Chinese sentence respectively, and the double feed-forward multi-self-attention module encodes these embeddings, which are finally decoded by the conditional random fields to obtain the label sequence.

c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 12 Figure 5 .
Figure 5. Flatten the lattice by using the head and tail positions of Chinese characters and words to record the position of each token in the lattice structure. https://doi.org/10.1038/s41598-023-50705-0

Figure 7 .
Figure 7. Using double feed-forward neural networks to clip multi-head self-attention modules with residual connections and layer normalization between them.

Figure 8 .
Figure 8. F 1 Curves during training of all models on the aerospace dataset.MFT's F 1 curve is essentially above FLAT.

Figure 9 .Figure 10 .
Figure 9. Precision curve during training of the comparison model on the aerospace dataset.MFT has a significantly higher precision rate curve than FLAT.

Figure 11 .
Figure 11.Visualization of attention for MFT and FLAT.MFT has a broader focus and more semantic features are extracted by the self-attentive mechanism.

Figure 12 .
Figure 12.Loss curves for MFT and FLAT.MFT converges faster than the FLAT model and has lower losses.

Table 1 .
The main information of aerospace dataset.
re 2 re 3 re 4 re 5 re 6 re 7 re 8 re 9 re 10 re 11 re 12 Extraction of the radicals in Chinese characters using 1D convolution to obtain the radical embedding for each character.

Table 2 .
Main information of weibo and resume datasets.

Table 4 .
Aerospace NER results.Significant values are in bold.

Table 5 .
MFT's F 1 scores for different entity types.

Table 6 .
Weibo NER results.Significant values are in bold.

Table 7 .
Main results on resume NER.Significant values are in bold.

Table 8 .
Result of different feature fusion method.

Table 9 .
Result of different FFN.

Table 10 .
Result of ablation study.